Method and system for assessing relevant properties of work contexts for use by information services

ABSTRACT

An information retrieval system for automatically retrieving information related to the context of an active task being manipulated by a user. The system observes the operation of the active task and user interactions, and utilizes predetermined criteria to generate context representation of the active task that are relevant to the context of the active task. The information retrieval system then processes the context representation to generate queries or search terms for conducting an information search. The information retrieval system determines the relevance of a word to the context by utilizing an adaptive weighting system. The information retrieval system assigns varying weights to different attributes of a word and calculates an accumulated weight of the word by accumulating all weights assigned to the word. The attributes may include word size, style, location of the word, etc. The system then ranks the importance of words based on their respective accumulated weight, and chooses words that rank within a predetermined number from the top to form search terms to conduct an information search using various data sources.

FIELD OF THE DISCLOSURE

The present disclosure generally relates to automatic method and systemfor forming queries to retrieve information, and more specifically, tomethod and system to automatically generate keywords, phrases and otherentities representing the content and/or context of an active task beingmanipulated by a user and to retrieve information based on therepresentation. The representation of work context may be used tosupport a variety of information services.

BACKGROUND OF THE DISCLOSURE

Information retrieval systems, such as databases or search engines,allow users to retrieve information related to a specific subject byusing one or more keywords that may be related to the specific subject.For example, a legal search service called Lexis® provided by theLexisNexis Group is widely used in the legal field to search for cases,journal articles, treaties, as well as other publications that arerelated to a specific topic or issue. Another information retrievalsystem, Google®, provided by Google, Inc., is a search engine commonlyemployed by internet users to search for web sites or online documentsthat are related to a specific subject matter.

In order to search for and retrieve documents related to a specificsubject matter, users need to formulate query that typically comprises aset of keywords, phrases, symbols, commands, and/or other entities thatare considered to be relevant to the subject matter or possiblycontained in the documents relating to the subject matter. This type ofinformation retrieval system poses problems to users because the usersneed to be familiar with the proper format for inputting queries intosuch systems. In addition, users need to have a basic understanding ofthe subject matter to be searched as well as of properties of thelanguage used to describe that subject in order to formulate properquery to conduct the search.

Some information retrieval systems provide assistance on queryformation. For example, a website www.ask.com provides a search functioncalled Ask Jeeves that allows users to input their questions in naturallanguage. The system will extract keywords from the questions andconduct a search accordingly. Lexis® also provides a similar functionallowing users to input search terms in natural language, either as aquestion or a statement. The system then extracts keywords from suchnatural language inputs to search for information related to thekeywords.

Although these tools provide basic assistance on query formation duringinformation search and retrieval, such tools cannot function effectivelyin more realistic work environments in which the content of the query ofquestion plays a paramount role. For example, consider the followingsearch scenarios related to the same keyword “caterpillar:”:”

Scenario 1:

A biology student writing a term paper on animal development. In thiscase, the information search should be related to metamorphosis, theprocess by which the caterpillar becomes a butterfly.

Scenario 2:

A contractor working on a construction plan for a new building. Thecontractor is most likely referring to Caterpillar, Inc., a majormanufacturer of construction equipment

Scenario 3:

A grade-school student writing a book report on Lewis Carroll's book,Alice's Adventures in Wonderland. In this case, information retrievedshould preferably be related to the character in the book, chapterexcerpts, and pictures that the student could include in her paper.

These scenarios illustrate various problems associated with conventionalinformation retrieval systems. The first problem is that conventionalinformation retrieval systems do not consider relevance of active goalsin searching for information. The active goals of the user contributesignificantly to the interpretation of the search terms and to thecriteria for judging a resource as being relevant to the search terms.Typically, these goals are not fully expressed by users in forming theirqueries when using conventional information retrieval systems.

The second problem is that conventional information retrieval systemsare subject to word-sense ambiguity. For example, The word “caterpillar”in scenario 1 should be treated differently from that in scenario 2. Thecontext of the request provides a clear choice of word sense between theinsect and the company. Conventional information retrieval systemscannot distinguish the subtle differences unless additional keywords orinformation are provided by the user.

The third problem is that conventional information retrieval systemsfail to consider audience appropriateness when searching and retrievinginformation based on keywords or queries provided by the user. Inaddition to the keywords provided by the user, attributes related to theuser in each of above the scenarios should also influence the choice ofresults. Sources appropriate for an advanced biology student will likelynot be appropriate for a student in grade school.

Moreover, when using conventional information retrieval systems, usersoften are unable to provide sufficient information in their queries.Studies show that on average, users' queries tend to be two to threewords long. Needless to say, a two-word query most likely does notcontain enough information to discern the active goals of the user, oreven the appropriate senses of the words in the query.

Furthermore, even if the user has sufficient knowledge to formulateworkable queries to conduct a search, the user must be aware of thevariety of available resources, decide where to find them, and must knowhow to use different information retrieval systems correctly, includingdetails such as those concerning special operators like “and,” “or,” or“+” that are used differently in different information retrievalsystems.

Therefore, there is a need to provide an automatic query formationsystem to assist users in retrieving information related to their activegoals without their intervention. There is another need for aninformation retrieval system to consider the context of words or phraseswhen conducting an information search and retrieval. There is also aneed to improve the performance of an information retrieval system byrefining queries based on various attributes related to the users. Anadditional need exists to automate the information search and retrievalprocess by forming queries in proper format for conducting informationsearch in different information sources.

SUMMARY OF THE DISCLOSURE

An exemplary information retrieval system addresses the above-notedproblems and needs. The exemplary information retrieval systemdynamically observes an active task being manipulated by a user, andcollects information regarding the active task. The system automaticallygenerates keywords, phrases, and/or representations that are relevant tothe context of the task being manipulated by the user based on theobservation and the collected information, and a variety of otherattributes and/or information, such as attributes concerning the user,the software application being employed, the state of the active task,as well as other considerations or additional information, etc. Thesystem then proactively retrieves information or documents, orreferences to other relevant resources, e.g., contact information forpeople who may be assisting or related to the tasks, from variousinformation resources by submitting properly formulated queries based onthe search terms. It then analyzes and organizes the search results forpresentation to the user.

In one aspect, the exemplary information retrieval system is implementedas a software application executed by a data processing system, such asa computer, PDA (personal digital assistant), mobile phone, or the like,and monitors the operation of other software applications, such as Word,Internet Explorer, Netscape, etc. The data processing system has accessto information repositories or information sources, such as databasesand/or internet search engines or the like. The information retrievalsystem monitors activities of active tasks being manipulated by the userand collects information relevant to the activities. The information maybe texts in the active task, fonts, styles of texts, locations, and/orother attributes of the active task, as well as user actions in softwareapplications, attributes of the user, and so on.

The information retrieval system utilizes predetermined criteria toautomatically select keywords, phrases, and other entities orinformation useful in search that are relevant to the context of theactive task being manipulated by the user based on the collectedinformation. The information retrieval system then processes thekeywords, phrases, and other entities, etc. to generate queries orsearch terms for conducting an information search on various informationresources. The information retrieval system then analyzes and organizesthe search results and makes them accessible to the user.

Various criteria are used to generate and/or select keywords that arerelevant to the context of the active task being manipulated by theuser. In one aspect, the information retrieval system excludes wordsthat have at least one of the following attributes: having less than nletters, wherein n is a tunable parameter, typically 2 or 3 (withcertain exceptions, such as terms stipulated in a list, and/or beingpart of a recognized entity, such as a trade name), containing allnumbers (with certain exceptions, such as terms stipulated in a list,and/or being part of a recognized entity, such as a street address), andmembership in a stop list, which may be in part predetermined, and inpart determined by properties of the active task (including, e.g.,portions of the URL or other identifier of the document or documentsrelating to the active task), properties of the user, and properties ofthe data sources to be searched.

In one embodiment, the information retrieval system may accessinformation related to exceptions to the exclusion criteria so thatcertain text items are preserved even if they meet the predeterminedcriteria. The information retrieval system may access a file containingan exception list that includes text items carrying contextualsignificance and would be excluded from the application of the exclusioncriteria. For example, certain text terms, such as “A1 steak sauce,” “i2Technologies,” etc. would be preserved if “A1” and “i2” are part of theexception list. The information retrieval system may exclude recognizedconstituent items from the application of the exclusion criteria. Forinstance, if the information retrieval system recognizes a text itemcontaining all numbers as part of an address, the information retrievalsystem would not exclude that text item.

In another aspect, the information retrieval system determines therelevance of a word, phrase, symbol, other entities, etc. to the contextof an active task being manipulated by the user by utilizing an adaptiveweighting system. The information retrieval system assigns varyingweight to different attributes of a word or text, and calculates anaccumulated weight of the word or text by accumulating all weightsassigned to the word or text. The attributes may include word size,style, location of the word, etc. A text appearing in a normal style mayhave its weight incremented by a tunable parameter p1, and a textoccurring as emphasized form may have its weight incremented by atunable parameter p2, wherein p1 and p2 are different weights. Thesystem may assign a heavier weight to a text that appears in a specificportion of the document, in an active window visible to the user, or ina portion selected by the user. The system may increase the weight of aword or text highlighted by the user. The system may also increase theweight of a word or text that is displayed in an emphasized form, suchas bold, italic, larger fonts, and the like.

The system may rank the importance of words or texts based on theirrespective accumulated weight. The information retrieval system maychoose the top ranked texts or words, such as the top 20, to formqueries to conduct an information search and retrieval from various datasources or to serve as a representation of the user's work context foruse in a variety of information services. The data sources may locateeither locally or remotely coupling to the system via a datatransmission network.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only exemplary embodiments of the presentdisclosure is shown and described, simply by way of illustration of thebest mode contemplated for carrying out the present disclosure. As willbe realized, the present disclosure is capable of other and differentembodiments, and its several details are capable of modifications invarious obvious respects, all without departing from the disclosure.Accordingly, the drawings and description are to be regarded asillustrative in nature, and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements and in which:

FIG. 1A is a block diagram showing system architecture of an exemplaryinformation retrieval system according this disclosure.

FIG. 1B shows a more detailed architecture of an exemplary informationretrieval system.

FIG. 2 depicts a block diagram of an exemplary data processing systemthat can be used to implement the information retrieval system accordingto this disclosure.

DETAILED DESCRIPTION OF THE DISCLOSURE

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the present disclosure. It will be apparent, however,to one skilled in the art that the present disclosure may be practicedwithout these specific details. In other instances, well-knownstructures and devices are shown in block diagram form in order to avoidunnecessarily obscuring the present disclosure.

System Architecture

FIG. 1 is a block diagram of system architecture of an exemplaryinformation retrieval system 100 according this disclosure. A userutilizes a data processing system, such as a computer, to executeapplication software 105. The information retrieval system 100 interactswith the application software 105 to detect and/or observe the operationand activities of an active task that the user is manipulating, anddetermines the user's potential information needs from this and otherinformation, e.g., attributes of the user, application software, type ofactive tasks, potential data sources, the state of the active task, thegenre or type of the active documents or tasks, as well as otherconsiderations or additional information, etc. Based on the detectionand/or observation, the information retrieval system generates a set ofqueries that are relevant to the content of the active task beingmanipulated by the user, plus additional information as just mentioned.The information retrieval system 100 then dynamically retrievesinformation from various information sources based on these queries.

In one embodiment, the information retrieval system 100 is able toaccess the document that the user is manipulating, and use the actionsthat the user performs on the document along with other information,such as attributes of the user (e.g., his or role in an organization),and attributes of the task being carried out by the user (e.g., thecurrent step in a sequence of steps of the task), to determine theuser's potential information needs. The information retrieval system 100then generates queries to initiate an information search in informationsources 108. The queries generated by the information retrieval system100 may be formulated into various formats suitable to the respectiveinformation source, and their contents as may reflect attributes of theinformation sources 108. The information retrieved from the informationsources 108 may be stored in the same data processing system executingthe application software 100, or be kept in one or more repositoriesthat can be accessed by the user.

Since the information retrieval system 100 generates queries based onthe activities of the application software 105 without the userintervention, the user does not have to formulate or generateinformation search queries in a separate search operation.

The information retrieval system 100 may be implemented as a softwareprogram configured to be executed by the data processing system. Theinformation retrieval system 100 and the application software 105 may beexecuted by a data processing system, such as a computer, PDA (personaldigital assistant), mobile phone, or the like. The software application105 may be any type of software applications that a user may employs,such as word processors, web browsers, e-mail applications, etc. ondifferent operating system platforms such as Windows XP®, Palm OS®,Linux®, Mac OS X®, PocketPC® and the like.

The information retrieval system 100 may be part of the applicationsoftware 105, or an add-in program or software object corresponding tothe application software 105 that may be employed by the user. Accordingto one embodiment, whenever the user launches application software 105,the information retrieval system 100 corresponding to the launchedapplication software 105 is also launched. The information retrievalsystem 100 then monitors the operation of the application software 105and generates queries based on the observation. According to anotherembodiment, the information retrieval system 100 is launched manually bythe user, such as by pushing a specific button or clicking a specificicon.

The application software 105 may be any application software that isutilized by a user to perform a task or tasks, such as Microsoft Word®,a word processing package, Microsoft Internet Explorer®, and NetscapeNavigator®, all Web browsers, Microsoft Outlook®, an email client, andMicrosoft Power Point®), a presentation package, and so on.

The information source 108 may be located in the same data processingsystem that executes the information retrieval system 100, or indifferent machine or machines that are coupled to the data processingsystem executing the information retrieval system 100 via a datatransmission network, such as intranet, internet, LAN (local areanetwork), or the like, by wire or wirelessly or both.

The information source 108 may include a compilation of searchable data,such as one or more databases, and/or any hardware that performsinformation search and retrieval, or a combination of both. For example,the information source 108 may be servers that execute software programsfor search engines, such as Google, or commercial database services suchas LexisNexis or WestLaw. The information source 108 may further connectto other information repositories and/or databases.

FIG. 1B shows a more detailed architecture of the information retrievalsystem 100. The information retrieval system 100 employs anadapter-based architecture, in which both interfaces to applicationsoftware 105 and to information sources 108 are encapsulated as softwarecomponents called application adapters 151, and information sourceadapters 153, respectively. The application adapter 151 and informationsource adapter 153 are used to interface with various applicationsoftware 105 and information source 108, respectively, such that theinformation processing component 152 can communicate with theapplication software 105 and/or information source 108 properly. Theinformation retrieval system 100 utilizes a component called informationprocessing component 152 to conduct core calculation and data analysis.

Each application adapter 151 and information source adapter 153 isencapsulated. Adapters are akin to software plug-ins, in that both addadditional functionality using a predefined interface. The informationretrieval system 100 may have several application adapters 151, whichare used to gain access to application software's internalrepresentation of a document or active tasks, and the application-levelevents generated by user interactions with the application software 108.

The application adapters 151 may extract information related to text ofthe user's current document or active task, including words and/orgraphics and/or objects used in the document, attributes of the words,such as size, style and the like, as well as properties of the documentsas a whole, such as its genre, type or subject matter. The applicationadapters 151 may also obtain information related to the operation statusof the application software 105, such as the user's focus of attentionwithin a document, the user's action performed on the document, etc. Forexample, the application adapter 151 may obtain information related tothe portion or page of a document that is being displayed to the user,or information related to regions of a document that the user hasselected and words contained in the selected regions. The informationobtained by the application adapter 151 is then passed to theinformation processing component 152 for further processing.

The information retrieval system 100 may have different applicationadapters 151 and information source adapters 153 corresponding todifferent application software 105 and different information sources108, respectively. For example, various application adapters 151 can beprovided to different application software 105, such as MicrosoftInternet Explorer®, and Netscape Navigator®, both Web browsers,Microsoft Outlook®, an email client, and Microsoft Power Point®, apresentation package, that is configured to be executed on differentoperating system platform, such as Solaris®, Linux®, Windows® operatingsystems, and the like.

The application adapter 151 and information source adapter 153corresponding to different application software 151 and informationsource 108 may be packaged individually such that the informationprocessing component 152 may access each application adapter 151 and/orinformation source adapter 153 separately. In addition, applicationadapter 151 and information source adapter 153 corresponding to newapplication software 151 and information source 108 may be developed andadded to the information retrieval system 100 from time to time.Furthermore, the application adapter 151, information source adapter 153and the information processing component 153 do not have to reside inthe same data processing system. Rather, components can reside indifferent data processing system. The system may retrieve or remotelyaccess and utilize the information processing component 152 as well asapplication adapters 151 and/or the information source adapter 153whenever necessary.

This architecture allows the information retrieval system 100 to adaptto changes without requiring a full-scale redeployment. Informationsource adapters 153 can be updated when information sources 108 arechanged. New information source adapters 153 can be written toencapsulate new information service offerings. In addition, as newapplication software 105 is available, corresponding applicationadapters 151 can be distributed, enabling the information retrievalsystem 100 to provide users contextually-relevant information in the newapplication software 105.

The architecture also provides abstractions for managing communicationsbetween the information processing component 152 and applicationsoftware 105, and between the information processing component 152 andinformation resources 108. In addition, it specifies a mechanism formanaging communications among software components responsible forcontent analysis and query generation. This architecture also allows theinformation retrieval system 100 to adapt to different applicationsoftware 105 and/or information source 108 relatively easily. In eachcase, the software components may reside locally or remotely.

In one embodiment, information source adapters 153 is written in anXML-based interpreted language. Application adapters 151 may be writtenin the language most convenient for accessing the internal state of theapplication software 105. The application adapters 151 can communicatewith the information processing component 152 via a standard programminginterface, such as an application programming interface (API) or usingthe application software's internal scripting language. For thoseapplication software that does not readily support access to internalstate through one of those mechanisms, application adapters 151 can bewritten, for example, by trapping low-level operating system librarycalls that draw text to the screen.

The application adapter 151 may obtain information related to thedocument that the user is manipulating using the application software'sprogramming API (Application Programming Interface) or OS-level APIS.For example, for Microsoft Word®, the “Normal style” of the currentdocument is determined by inspecting the object model's style objects.For each paragraph in the document, the style is extracted by inspectingthe properties exposed by the object model for the range. If the weightof the font used in the range is bold, if the range is centered, or ifthe range of text is displayed in a font size that is greater than thefont associated with the Normal style in the current document, the rangeis classified as emphasized. If the font size of the text in the currentrange is less than the size of the font associated with the Normalstyle, then range is classified as de-emphasized. Otherwise, the rangeis classified as normal. Selected regions are determined by inspectingthe object model's selection object, and are sent separately using theselected style.

If the application software does not easily allow access to suchattributes through an API, the application adapters 151 may computethese properties. The process may involve, for example, computingaverage or modal values for the properties over certain spans of text,and comparing text within the spans to these averages or modes todetermine whether the text is emphasized. For example, if the font islarger than the average or modal size computed by the applicationadapters 151, the text is considered and/or classified as emphasized.

The application adapters 151 interpret application-level events so thatthey can be translated into an event representation the informationprocessing component 152 can process. For example, when a user typestext into a document in Microsoft Word®, keyboard events are generatedin the application. The Word application adapter interprets theseevents, paying mind to their target (in this case, the document the useris modifying). The application adapter 151 can then relay a message tothe information processing component 152, in this case, indicating thedocument has changed. The information processing component 152 thenqueries the application adapter 151, requesting a representation of theupdated document. The application adapter 151 then produces a documentrepresentation and sends it to the information processing component 152for analysis.

In response, the information processing component 152 analyzes thecontent of the document that the user is manipulating or the actions sheperforms. Based on the analysis as well as other information, such asattributes of the user and of the task being carried out, and/or otherattributes of the application software, the information processingcomponent 152 determines to query one or more information sources 108.The information processing component 152 may produce an internal queryrepresentation capable of representing Boolean combinations of terms orquoted phrases. This internal query representation is sent toinformation source adapters 153 corresponding to the information sources108 that the information processing component 152 decides to query. Eachinformation source adapter 153 translates the internal queryrepresentation into the source-specific query language and/or modifiesthe content based on attributes of the source, and executes a search.Information source adapters 153 are also responsible for parsing theresults of a search into a standard representation, as in metasearchapplications.

In order to support flexible parameterization and customization ofinformation extraction and keyword selection algorithms used tocharacterize the context of a user's active task, the informationprocessing component 152 comprises two types of constituent processingcomponents, context analyzers and query producers, both of which areencapsulated and organized using a shared memory in a Blackboard-stylesystem. Both types of constituent information processing components maybe executed on the user's machine and/or hosted remotely on a networkserver responsible for their executions.

The information processing component 152 enables the coordination ofconstituent processing components through a hierarchical shared memory,or blackboard. Each component has its own thread of control, and canlisten to any level of the hierarchical shared memory in order to reactwhen information is added or removed. The architecture supports run-timeconfigurable information processing plug-ins so that the system'sfunctionality can easily adapt to user requirements after deployment.

The first type of constituent information processing components iscontext analyzers, which produce representations of the user's context.Context analyzers performs its function by analyzing the user's actionswithin applications, and by analyzing the content of the document theuser is manipulating. For example, the context analyzer produces afrequency histogram of the document the user is currently manipulating.

The second type of constituent information processing components isquery producers. Query producers are activated when representations ofthe user's context become available via the shared memory. Queryproducers are responsible for transforming representations of the user'scontext into a set of information goals or queries. The query producersmay use a histogram of word frequencies in the user's document,presentation information provided by the application adapters andstatistical information about a given language to arrive at a rankedlist of the most representative words in a given document.

FIG. 2 shows a block diagram of an exemplary data processing system thatcan be used to implement the information retrieval system 100. The dataprocessing system 200 includes a bus 202 or other communicationmechanism for communicating information, and a data processor 204coupled with bus 202 for processing data. Data processing system 200also includes a main memory 206, such as a random access memory (RAM) orother dynamic storage device, coupled to bus 202 for storing informationand instructions to be executed by processor 204. Main memory 206 alsomay be used for storing temporary variables or other intermediateinformation during execution of instructions to be executed by dataprocessor 204. Data processing system 200 further includes a read onlymemory (ROM) 208 or other static storage device coupled to bus 202 forstoring static information and instructions for processor 204. A storagedevice 210, such as a magnetic disk or optical disk, is provided andcoupled to bus 802 for storing information and instructions.

The data processing system 200 may be coupled via bus 202 to a display212, such as a cathode ray tube (CRT) or liquid crystal display (LCD),for displaying information to an operator. An input device 214,including alphanumeric and other keys, is coupled to bus 202 forcommunicating information and command selections to processor 204.Another type of user input device is cursor control 216, such as amouse, a trackball, or cursor direction keys and the like forcommunicating direction information and command selections to processor804 and for controlling cursor movement on display 212.

The data processing system 200 is controlled in response to processor204 executing one or more sequences of one or more instructionscontained in main memory 206. Such instructions may be read into mainmemory 206 from another machine-readable medium, such as storage device210. Execution of the sequences of instructions contained in main memory206 causes processor 204 to perform the process steps described herein.In alternative embodiments, hard-wired circuitry may be used in place ofor in combination with software instructions to implement thedisclosure. Thus, embodiments of the disclosure are not limited to anyspecific combination of hardware circuitry and software.

The term “machine readable medium” as used herein refers to any mediumthat participates in providing instructions to processor 204 forexecution. Such a medium may take many forms, including but not limitedto, non-volatile media, volatile media, and transmission media.Non-volatile media includes, for example, optical or magnetic disks,such as storage device 210. Volatile media includes dynamic memory, suchas main memory 206. Transmission media includes coaxial cables, copperwire and fiber optics, including the wires that comprise bus 202.Transmission media can also take the form of acoustic or light waves,such as those generated during radio wave and infrared datacommunications.

Common forms of machine readable media include, for example, a floppydisk, a flexible disk, hard disk, magnetic tape, or any other magneticmedium, a CD-ROM, any other optical medium, punch cards, paper tape, anyother physical medium with patterns of holes, a RAM, a PROM, and EPROM,a FLASH-EPROM any other memory chip or cartridge, a carrier wave asdescribed hereinafter, or any other medium from which a data processingsystem can read.

Various forms of machine-readable media may be involved in carrying oneor more sequences of one or more instructions to processor 204 forexecution. For example, the instructions may initially be carried on amagnetic disk of a remote data processing system, such as a server. Theremote data processing system can load the instructions into its dynamicmemory and send the instructions over a telephone line using a modem. Amodem local to data processing system 200 can receive the data on thetelephone line and use an infrared transmitter to convert the data to aninfrared signal. An infrared detector can receive the data carried inthe infrared signal and appropriate circuitry can place the data on bus202. Bus 202 carries the data to main memory 206, from which processor204 retrieves and executes the instructions. The instructions receivedby main memory 206 may optionally be stored on storage device 210 eitherbefore or after execution by processor 204.

Data processing system 200 also includes a communication interface 218coupled to bus 202. Communication interface 218 provides a two-way datacommunication coupling to a network link 220 that is connected to alocal network 222. For example, communication interface 218 may be anintegrated services digital network (ISDN) card or a modem to provide adata communication connection to a corresponding type of telephone line.As another example, communication interface 218 may be a local areanetwork (LAN) card to provide a data communication connection to acompatible LAN. Wireless links may also be implemented. In any suchimplementation, communication interface 218 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 220 typically provides data communication through one ormore networks to other data devices. For example, network link 220 mayprovide a connection through local network 222 to a host data processingsystem 224 or to data equipment operated by an Internet Service Provider(ISP) 226. ISP 226 in turn provides data communication services throughthe world large packet data communication network now commonly referredto as the Internet 227. Local network 222 and Internet 227 both useelectrical, electromagnetic or optical signals that carry digital datastreams. The signals through the various networks and the signals onnetwork link 220 and through communication interface 218, which carrythe digital data to and from data processing system 200, are exemplaryforms of carrier waves transporting the information.

Data processing system 200 can send messages and receive data, includingprogram code, through the network(s), network link 220 and communicationinterface 218. In the Internet example, a server 230 might transmit arequested code for an application program through Internet 227, ISP 226,local network 222 and communication interface 218.

The data processing system 200 also has various signal input/outputports (not shown in the drawing) for connecting to and communicatingwith peripheral devices, such as USB port, PS/2 port, serial port,parallel port, IEEE-1394 port, infra red communication port, etc., orother proprietary ports. The data processing system 200 may communicatewith the data processing system via such signal input/output ports.

According to one embodiment, in order to support the awareness of thecontext of the tasks users are performing, and without requiringadditional overhead from the user, the information retrieval system 100is coupled closely with the application software 105. This couplingallows the information retrieval system 100 to become aware of what theuser is doing, and then react to the user by proactively providing herwith access to useful resources by constructing and manipulating lexicalrepresentations of the user's current work product. Theserepresentations are used to access networked resources on behalf of theuser, typically by querying information retrieval systems, that maysupply documents and/or references to other relevant information items.The obtained results may be further processed and organized by theinformation retrieval system for presentation to the user.

Text Analysis

Referring to FIG. 1B, the information processing component 152 receivesinformation related to operations and/or activities of the applicationsoftware 105 from the application adapter 151. Based on the receivedinformation, the information processing component 152 determines thecontent of the active task or tasks. Based on this determination, theinformation processing component 152 generates appropriate queriesrelated to the content to conduct searches in different informationsources 108. For example, the information processing component 152 maygenerate queries for retrieving documents related to subject mattersthat are similar to, or have opposite viewpoints of, those of thedocument that the user is manipulating. The queries are sent to avariety of information sources 108, each of which is accessible to theinformation retrieval system 100 via a corresponding information sourceadapter 153. When search results are returned from the informationsources 108, the information retrieval system 100 may further analyzethe results and present them to the user.

Exemplary processes for analyzing active tasks and generating queriesrelated to the content of active tasks are illustrated using thefollowing examples, in which a user employs Microsoft Word®) to reviewand/or prepare a document related to a specific subject matter. When theuser launches Microsoft Word® on her computer, the information retrievalsystem 100 is also launched, either automatically or manually. Anapplication adapter 108 corresponding to Microsoft Word is used tointerface with Microsoft Word®.

In order to retrieve related documents as the user is writing orbrowsing, the information processing component 152 constructs a query ora set of queries based on the content of the document at hand, and otherinformation as described earlier. The queries produced by theinformation processing component 152 are sent to external informationsources 108 to retrieve related documents.

The text of an active document being manipulated by the user andanalyzed by the information retrieval system 100 includes a plurality oftext items comprising symbols, words, numbers, etc. The text may includespans of text representing a detectable section of a document, such as adocument's title, or other relevant portions of the document asdetermined by the application programming interface (API) accessed bythe application adapter 151 or by other means, including the operatingsystem. Software APIs for accessing or automating existing softwaretypically divide texts into logical portions representing constituentparts. For example, a presentation application software such asMicrosoft PowerPoint® provides methods for determining the boundaries oftexts contained on a single slide, and further divides the texts on aslide into regions such as the title and bullet points. In addition tothis method of determining the natural boundaries of portions of texts,special purpose constituent element detectors may be constructed thatrecognize sequence of tokens, symbols, etc., that comprise a constituentelement, such as a street address.

The text of an active document being manipulated by the user may includeconstituent elements that generally would be excluded from the contextrepresentation or query constructed by the information processingcomponent 152. However, under certain circumstances, the constituentelements may be useful in formulating queries to specific informationsources 108 that require specific query content. For example, aconstituent element such as an address may be required by a mappingengine.

In one embodiment, the information retrieval system applies differentrules to eliminate constituent elements based on properties of theactive task. The information retrieval system may apply different querytransformation rules based on properties of the active task.

The information retrieval system utilizes software components to detectconstituent elements. The software components represent a finite stateautomaton that accepts sequences of characters, character types,symbols, symbol types, strings or sequences of symbols, texts, and thelike. When a constituent element, such as address, signature block,navigation bar, etc., is detected by the finite state automatoncorresponding to a given constituent element type, the informationprocessing component 152 may be caused to perform a specific set ofactions based on a set of transformation rules corresponding to thedetected type of constituent elements, such as eliminating certainconstituent elements, adding additional information related to thedetected constituent elements into the query, and/or selecting aparticular information source 108 based on the detected constituentelements to conduct information search, etc. Details of transformationrules that may be used will be discussed shortly.

For example, a constituent element detector for a signature block in ane-mail might be composed of a finite state automaton accepting sequencesof characters that include the sender's name, followed by sequences ofcharacters matching a pattern representative of phone numbers and/or astreet address. In response to the detection of the specific constituentelement, the information processing component 152 may utilize atransformation rule that excludes the signature block from the textbeing analyzed to form a context characterization or representation,and/or information request.

The information processing component 152 employs a term selectionalgorithm that implements a set of heuristics to eliminate words thatare unlikely to be related to the subject matter of the document thatthe user is manipulating, and to select words that may be indicative ofthe subject of the document. The words included in a document typicallyhave the property of disambiguating each other's meaning. For example,out-of-context, it is equally likely that the phrase, “oracle delphi,”could be referring to a partnership between two software companies, orit could be evoking the place of worship in ancient Greece. However,when the phrase is coupled with other words used in the same context inwhich the above words were drawn, the reader can easily disambiguateamong potential meanings. For example, additional context words like:“software product development information technology,” may indicate thatthe phrase “oracle delphi” may be referring to a partnership between twosoftware companies, while words like: “ancient greece apollo omphalospythia,” may suggest that the phrase “oracle delphi” is referring to theplace of worship in ancient Greece. Thus, these additional words wouldassist in generating more precise queries because they eliminate theambiguity inherent in shorter queries.

The information processing component 152 may be implemented based onanalysis on how documents are structured to convey pragmatic concerns,in particular, the importance of spans of text. In addition, theinformation processing component 152 may leverage as much contextualinformation as possible from the application software 105, includingindications of the user's current focus of attention, and otherinformation, such as attributes related to the user, the applicationsoftware being utilized, the state of the active task, as well as otherconsiderations or additional information, etc.

During operation, documents being manipulated by the user arerepresented to the information processing component 152 as a stream ofwords or punctuation. An exemplary information processing component 152may use one or more of the following word selection heuristics toconvert the words and/or punctuation:

Heuristic 1: Remove stop words:

The information retrieval system 100 maintains, or has access to, a stoplist including the most commonly occurring words that provide littleinformation about the subject of the user's document. Words included inthe stop list are not good search terms because the informationresources 108 will often remove them automatically. The stop list may becreated by a linguistic expert, by an automatic analysis (such asstatistical), or by the user or by a combination of all three. In oneembodiment, the stop list may be stored in the storage device of thedata processing system on which the user is working. The stop list mayalso be retrieved or accessed dynamically during its operation from aremote computer or server when the information retrieval system 100 islaunched. In operation, when the information processing component 152processes texts retrieved from the document that the user ismanipulating, the information processing component 152 will access thestop list and parse through the texts of the document to remove wordslisted in the stop list, unless they are part of a larger recognizedentity. The stop list may be of general use, specific to data sources ordomains of application, or both. Exception lists associated withparticular elimination criteria or other analysis rules may be similarlyconstructed, distributed and accessed.

Heuristic 2: Value frequently used words:

The information processing component 152 may calculate weights for eachword contained in the document based on their respective attributes. Forinstance, words used frequently are representative of the document'scontent. Thus, the information processing component 152 dynamicallycalculates the frequency, or the number of appearance, of each text inthe document that the user is manipulating. The information component152 may select a certain number of texts to construct queries based ontheir frequency rankings.

Heuristic 3: Value emphasized words:

The application adapter 151 may determine styling attributes for eachword contained in the document and communicate the result to theinformation processing component 152. Styles of words may be anindication of importance of words used in a document. Emphasized wordsare more representative of the document's content than other words.Emphasized words are used in titles, section headings, etc. Informationrelated to styles of words is obtained by the application adapter 151and forwarded to the information processing component 152. Based on theinformation, the information processing component 152 determines whethera word or words are emphasized.

Heuristic 4: Value words that appear near the user's focus of attention:

The information processing component 152 may determine locationattributes for each word contained in the document that the user ismanipulating based on the location of each word in the document, ascommunicated by the application adapter 151. For example, textscontained in a current slide being displayed to the user will be moreindicative of her immediate needs than the text in the rest of thepresentation. If a user selects a region of text in the document, textscontained in the selected region may be given a larger weight than wordsin other sections.

Heuristic 5: Devalue words that appear to be intentionallyde-emphasized:

Another styling attribute that the information processing component 152may receive from the application adapter 151 is whether a word isde-emphasized. De-emphasized words are deliberately made smaller by theauthor to make them less distracting or, in some cases (e.g., privacystatements), hard to read. Thus, de-emphasized words, such as words insmall fonts, may be exempt from Heuristic 4.

Heuristic 6: eliminate words that occur in sections of the document thatare not indicative of its content.

Words that occur in the navigation bar of a Web page are only marginallyuseful, and tend to interfere with other useful text analysis heuristicswe use. Likewise words that occur in a document template (e.g., in afooter that occurs on every page) are not as useful as those that occurin the main body of the document.

The heuristics described above are for illustration purpose only and arenot exhaustive. Other heuristics can also be used depending on designpreference.

As discussed earlier, documents are represented to the informationprocessing component 152 as a stream of words or punctuation. Inoperation, words contained in a document usually fall in one of fourstyles: normal, emphasized, de-emphasized, selected or list item. Asdescribed earlier, the information processing component 152 uses a stoplist to tag common words that have little information value (e.g., wordslike “and,” “or,” and “the”). Punctuation may be kept in order to makethe detection of phrase boundaries easier.

Text that appears in a heavier type or in a larger font size draws thereader's attention more than other words, and thus should be entitled toheavier weight in considering relevance to the content of the document.The application adapter 151 compares spans of text in the active taskwith respect to the normal text size and weight for the user's currentdocument, which can vary from document to document. Thus, in order todetermine whether a span of text is emphasized, de-emphasized, ornormal, the application adapter 151 needs to calculate values for thenormal presentation style, such as average height of words contained inthe document being manipulated by the user.

Information related to attributes of words in a document is obtained bythe application adapter 151 corresponding to each application software.For example, attributes of a span of text, such as emphasized,de-emphasized, or normal, may be obtained by detecting the appropriatestructures in HTML documents (for Internet Explorer), or by using theword style properties provided by the Microsoft Office® applications.Each application software has a different set of heuristics tailored tothe typical structure and content of documents created or viewed withinthat application software. In order to compute the normal presentationstyle, the frequency of the line heights and font weights for each spanof text in a document is measured. It is assumed that the line height ofa span of text can be determined through the programming interfaceprovided by the application.

Some application software provides information related to normal styleof the active task. The application adapter 151 may access suchinformation via the application programming interface (API) of theapplication software. Should the API not provide direct access to a“normal” presentation style, the application adapter 151 may compute thefrequency distribution of any given presentation property, such as fontsize or line height. The mode of this frequency distribution (the mostfrequent value of the presentation property) represents the normal valueof this property.

If, however, application software does not provide information relatedto attributes of text, the application adapter 151 needs to compute thenormal presentation style of the text based on the observation of thetext. The application adapter 151 may trap low-level operating systemcalls sued to draw text to the screen to determine the requiredpresentation attributes. The normal value of each attribute can becomputed in a way similar to computing those provided by the applicationsoftware.

The application adapter 151 maintains two tables: one maps from lineheight to frequency, and the other maps from font weight to frequency.For each span of text in the document, the line height and font weightsare computed, and their frequencies are incremented. The most frequentline height and font weight is considered the “normal” size and weight.Those spans of text that have line heights above the normal size areclassified as emphasized. Those that have line heights below the normalsize are classified as de-emphasized.

The application adapter 151 eliminates background texts contained in thedocument being manipulated by the user. Documents with multiple pagesoften contain text that occurs on every page as part of standard headersor footers. Analysis of the frequency of words in the document wouldresult in the high frequencies being assigned to words that occur inbackground text, even though these highly frequent words may have littleto do with the primary content of the document.

In order to eliminate background or template text, the applicationadapter 151 eliminates spans of identical text that occur on most pagesin the same locations. Spans of text that occur at the same locationsare detected by enumerating the list of text spans in a document. It isassumed that bounding rectangles can be determined by examining thedocument object model within the application software. A table that mapsfrom a string representing the bounding rectangle to a list of stringscontained within that rectangle on each page is maintained. A secondtable that maps from a string representing the bounding rectangle andthe string contained within the rectangle to its frequency in thatposition is also kept. For each span of text, the span is added to thefirst table if it does not already contain it, and its count isincremented in the second table. Then for each span of text, a span iseliminated if it occurs in the same place more than T times, where T isa tunable parameter related to the number of pages in the document. Forexample, T=0.8 n, where n is the number of pages in the document.

The information computed by the application adapter 151 is communicatedto the information processing component 152. The information processingcomponent 152 then transforms the string of characters it receives intosequences of characters that represent words. The information processingcomponent 152 splits the character string along spaces and carriagereturns, and then removes punctuation except for the dash.

The information processing component 152 further utilizes severalelimination criteria to convert the text to generate a list of the keywords and/or phrases and/or other entities representing the user'scontext. The information processing component 152 eliminates wordsaccording to the elimination criteria described above.

The information processing component 152 may use an additionalelimination criterion for Web pages to remove words that occur in thehost name of a Web site. For example, if the document being manipulatedby the user contains a URL www.cnn.com, the information processingcomponent 152 will ignore the term “cnn.”

For words surviving the elimination process, the information processingcomponent 152 applies an adaptive weight system to determine theimportance of the words based on the attributes of the words. Theattributes of the words are obtained by the application adapters 151 asdescribed earlier. The weight for each word is initially set to zero.For each occurrence of a word in a document, the weight is computed byadding an incremental value to the initial weight. When a word occurs ina normal style its weight is incremented by a tunable parameter p1. Whenit occurs as emphasized its weight is incremented by a tunable parameterp2. When it occurs as de-emphasized its weight is incremented by atunable parameter p3. Typically p2>p1>p3 with p1=1, p2=2, and p3=1/k,where k is the number of terms to be included in the documentcharacterization being constructed. Words that appear in regionsselected by the user in the document are incremented by n times themaximum global word frequency, or some other parameters derived from theactive task or its attributes, wherein N is typically equal to 2, inorder to ensure that they will appear in the characterization or therepresentation being constructed. Words with weight above the mean areselected as one of the words that will be considered for inclusion inthe query.

The attributes described in the adaptive weight system are forillustrative purpose only, and do not intend to be exhaustive. Otherattributes and weights can also be utilized to assign weights todifferent words.

Query Formation and Transformation

After the information processing component 152 extracts words relevantto the content of the document being manipulated by the user, theinformation processing component 152 performs a query formation processto generate queries to retrieve documents being manipulated by the user.The information processing component 152 uses the top n words from theordering resulting from the weighting described above to form thequeries. Typically n=20. Other number of n can be used depending ondesign preference, for example, properties of the information sourcebeing queried.

The information processing component 152 reorders the terms in the queryso that they occur in a meaningful order as they naturally occur in thedocument. For example, if the user is writing a paper regarding to theNBA player, Michael Jordan, the phrase “Michael Jordan” appearsfrequently in the paper. Thus, a meaningful search query should use thecombined terms “Michael Jordan” to conduct an information search ratherthan using “Michael” and “Jordan” separately, or in the wrong order, orseparated by other terms, all of which would reduce its utility in aquery for many searchable data sources.

In order to determine which word goes with which, the informationprocessing component 152 constructs a table mapping from word to theterm that occurs next to it, and its frequency of occurrence during theinitial analysis phase described earlier. The terms in the query arethen reordered according to the table. For each term in the query, thoseterms occurring after it in sequence equal or more times than the meanfrequency are considered required next terms, if they also occur in thequery. This process is repeated for each required next term, until oneor more sequences of one to k terms are generated. K is a natural numberassigned based on system design preference. In one embodiment, thesesequences are placed at the beginning of a query. This query is thenused to retrieve information from the information sources 108. Forinstance, in the paper, the term “Jordan” appears after the term“Michael” frequently. The information processing component 152 thus willconsider “Jordan” a required term after “Michael.” Thus, even if“Michael” and “Jordan” have different weights after the initial textprocessing procedure, the information processing component 152 is ableto generate a combined search term “Michael Jordan.”

In addition to the queries generated by the information retrieval system100, the user may manually submit a query to initiate an informationsearch process. According to one embodiment, the information processingcomponent 152 directly submits the query to various information sources108. According to another embodiment, in response to a query submittedby the user, the information processing component 152 incorporates theuser-generated query or queries into the query generated by theinformation processing component 152 by concatenating the query terms inthe user's query and the previously constructed contextual query to forma single query. In this way, the information processing component 152brings the previously gathered information about the context of theuser's work to bear directly on the process of servicing a user'sexplicit query.

For example, when a user is viewing a page about NASA's latest Marsprobe, and enters the query “life”, the information processing component152 dynamically formulates queries related to the probe with the term“life.” Thus, the information processing system 152 is able to retrievea list of pages about life on Mars, not the magazine, the game, thealgorithm, nor the biological definition, that would usually comes upusing a single term “life”. Because the information processing component152 grounds explicit queries in the context of the current document, theresults returned are coherent, even for this highly ambiguous query.

The information retrieval system 100 may utilize additional techniquesto refine or transform queries generated based on the content of thedocument being manipulated by the user. In one embodiment, querytransformations comprise two processes: (1) transformation activationand (2) application of transformation rules. Query transformations maybe activated either automatically in response to certain conditionsbeing met, or manually by user control. The transformation rulesutilized for conducting query transformation may be dependent on thetype of condition that activates the transformation. The transformationrules may affect the contents of queries by adding, deleting,substituting, and/or transposing texts in the information query.Additionally, the transformation rules may also affect the informationsources 108 to be queried when certain activation criteria are met.

The information retrieval system 100 may alter the query based on presetconditions or profiles. For example, the information processingcomponent 152 may generate a refined query by adding information relatedto attributes of the user, such as the user's occupation, position in acompany, major in school, instruction, etc. For example, if the user isa teacher, one or more terms related to teaching, education, school,etc. may be added to the query, such as “curriculum,” “syllabi,” “classschedule,” etc. According to another example, as a search condition, theuser may designate a certain number of keywords that must be used insearching information. Thus, when the information processing component152 formulates search terms, the user set conditions are accessed andadditional terms are added to the query according to the conditions. Inaddition, the information resources used to retrieve information mayselected based on the user profile or the preset condition. For example,if the user is a teacher, databases related to education are searched.If the user prefers certain databases or search engines, the user mayset these conditions to refine the search result. Additionally, thesystem may automatically select appropriate data sources based onproperties of the active task, attributes related to the user, theapplication software being utilized, the state of the active task, aswell as other considerations or additional information, etc.

Queries may be further modified using pre-defined query transformationrules to modify the query or queries. The information processingcomponent 152 may apply the query transformation rules to substitute oralter words contained in the query. For example, the informationprocessing component 152 can access a substitution table including anantecedent and a consequent. The antecedent contains the word that willbe substituted, while the consequent contains words that the antecedentwill be substituted with. Each query transformation rule is applied bymatching the antecedent against each word in the query and, when a matchoccurs, substituting the word matching the antecedent with itsconsequent.

For example, counter arguments related to the same subject matter may beimportant to the user. Thus, knowledge of opposing experts in particulardomains is important to the user's active tasks. For instance, when theuser cites Karl Marx's idea of an ideal economic state, the informationretrieval system 100 will retrieve two sets of articles: one setrepresenting Marx's point of view, and another set representing AdamSmith's opinion. In order to retrieve documents or information relatedto the counter arguments, the information processing component 152 mayapply a set of query transformation rules as follows: RULE(ruleset1):SUBST karl marx/adam smith RULE(ruleset1): SUBST adam smith/karl marxRULE(ruleset1): SUBST marx/smith RULE(ruleset1): SUBST smith/marxRULE(ruleset2): SUBST capitalism/communism RULE(ruleset2): SUBSTcommunism/capitalism

Each of the query transformation rules may be triggered by the same ordifferent activation criteria. Query transformation rules are groupedinto sets of rules. The rules are activated and applied when the ruleset's activation criteria are met. Information extraction routines aresimilarly grouped and associated with activation criteria. Informationextraction routines detect special-purpose, structured informationwithin the document the user is manipulating. Exemplary activationcriteria may include:

Properties of the text or content being manipulated by the user ascommunicated to the information processing component 152 by theapplication adapter 151 including, for instance, the genre of the text,such as contract, research report, proposal, etc., or the subject matterof the text, such as biology, engineering, literature, etc., both asdetermined by the occurrence of specific lexicalizations representing acertain concept, words, or other symbols. For example, if the contentdescribes a specific event, then information extraction routines thatrecognize the principal actors in the event, the event's location, theduration, the occasion, etc., will be activated. Search terms related tothese additional traits will be added to the query. According to anotherexample, the user may input such indication manually.

The genre or type of the active task that the user is manipulating maybe determined based on the type of application software being employedto perform the active task. For example, an email application displaysand allows the user to compose email. In addition, the applicationsoftware may provide additional information related to the active taskbased on the working environment used in performing the active task. Forinstance, some application software provides document templates, such asresume, customer inquiries, training manuals, etc. Thus, the informationretrieval system may obtain or determine the genre or type of the activetask based on the type of templates being used by the user.

The content of the text that the user is manipulating may contain words,phrases, symbols or other properties that may be used to determine thetype of document. For example, a scientific research paper generallyincludes sections such as “Abstract” and “References.” The detection ofsuch terms in a document may serve as an indicator that the document isa research paper.

In addition, other properties of the text that the user is manipulatingmay be computed by inspecting its contents. For example, the readinglevel of the text can be computed as a function of sentence length,average word length, etc. In addition, other properties such as thelevel of detail of the text can be computed by examining the specificityof the language used in its contents.

(2) Properties of the user and the user's position or role within anorganization. For example, if the user is a salesman in an engineeringcompany, technical jargon in the query could be translated into morestraightforward language, allowing the system to retrieve documents morecomprehensible to the user. The information related to the user may bestored in a file that is accessible by the information retrieval system.The identity of the user may be determined during a log-in process.

(3) Properties of the data source to be interrogated by the informationsource adapters 153 on behalf of the information processing component152. These properties may include, for example, a list of texts or typesof texts that comprise valid queries to the data source, or additionallya list of texts or types of texts that should be excluded from inclusionin queries sent to the data source. For example, if the data source isexternal to an organization, internal product names can be translatedinto their market equivalents using the kind of transformation ruledescribed above.

(4) Properties of the application software being manipulated by the useras detected by the application adapter 151 and communicated to theinformation processing component 152. For example, if the user isediting a document that uses the “Resume” template, the system canactivate custom content analysis routines. In another example, if theapplication software used is an e-mail application software, apredetermined set of transformation rules corresponding to e-mailapplication software may be activated. For instance, a transformationrule may be applied to eliminate specific portions of text such as thesignature or salutation.

(5) The current state of the active task in which the user is currentlyengaged as determined either automatically by the application adapter151 or by other means, such as a user input indicating her current stepin a multi-step task. For example, the information retrieval system maydetermine the active task being related to an on-line shopping processbased on price tags, shopping cart icons, etc. In response, theinformation retrieval system may provide several clickable iconsincluding “product survey,” “dealers,” “coupons,” etc. to solicit inputsfrom the user related to the stage of the online shopping process, sothat the information retrieval system may transform search queries basedon the different stages of the shopping process and retrieve informationrelated thereto.

Users may also be presented with means for selecting and/or designatingquery transformations along one or more specific traits. For example, aGUI (Graphic User Interface) with clickable buttons, menus, etc. may bepresented to the user. The clickable buttons or menus compriseselections of transformation traits, such as “economy,” “law,” “where tobuy,” “people,” etc. The user may activate one or more selections byclicking the corresponding buttons. In response, the informationretrieval system will add phrases and/or key words related to theselected traits. For instance, if a student is working on a paperdiscussing Iraq, the student may click the “economy” button when sheneeds information related to Iraqi economy. In response, the informationretrieval system will add additional keywords, phrases or other entitiesthat are related to the concept “economy,” such as “growth rate,”“recession,” “currency,” “exchange rate,” “export,” “import,” “GDP,”etc. to the search query constructed based on the student's active task.Furthermore, in response to the “economy” button being selected, theinformation sources 108 on which an information search will be conductedmay include sources related to economy or financial services, such asdatabases of Wall Street Journal or Financial Times.

Information Source Customization

The information retrieval system 100 allows users to customize whichinformation resources should be included in searches conducted by theinformation retrieval system 100. For example, users are able to createa wrapper that allows the information retrieval system 100 to search aspecific Web site or database. This could easily be accomplished byautomatically producing a wrapper that uses the site searchfunctionality currently made available by Internet search engines to thepublic. Search engines typically support functionality that allows usersto restrict their search to a given Internet Web site, using syntax like“site:www.xxxx.com” or “host:www.xxxx.com.” Private Web sites, e.g.,sites hosted behind a corporate firewall must be searched via adifferent mechanism supported by an indexing server. The informationretrieval system 100 may be configured to couple to index servers toaccess private Web sites in addition to those currently indexed bypublic services. Furthermore, the information sources 108 may includelocal storage devices residing in the data processing system on whichthe user is working, or data storage devices coupling to the dataprocessing system on which the user is working via a local datatransmission network like intranet or LAN.

Furthermore, the information retrieval system 100 may select informationsources 108 based on attributes related to the user, such as the user'soccupation, position in a company, major in school, etc., as well asproperties of the active task, application software, etc. As discussedearlier, the information processing component 152 may access a userprofile to retrieve information related to the user. If the user is abiology teacher, the information retrieval system 100 may restrict orextend the search to information sources that relate to biology,education, etc.

The information retrieval system 100 also recognizes opportunities toprovide assistance to the user by completing queries to special-purposeinformation repositories. The information processing component 152 has afacility for detecting standard textual entities (such as addresses orcompany names) and providing the user with an interface to usefulspecial-purpose information resulting from a query to specific kinds ofonline information sources. In order to detect conceptual units forspecial purpose search, the information processing component 152 runs anarray of simple detectors in parallel. Each detector is a finite stateautomaton accepting a sequence of tokens representing a conceptual unit.When a conceptual unit is detected, the information processing component152 may present the user with a common action for the item, for example,in the form of a button they can press. For example, when theinformation processing component 152 detects an address, it presents abutton which, when pressed, will display a web page with a map for thataddress using an automated map generation service. Such information mayalso be provided automatically. The information processing component 152also detects opportunities for retrieving special-purpose, structuredinformation in the context of document composition. For example, when auser inserts a caption with no image to fill it in their Microsoft Word®document, the information processing component 152 uses the words in thecaption to form a query to an image search engine. Users can then dragand drop the images presented directly into their document. Thisanalysis of actions is also performed using an array of simpleapplication-specific detectors running in each application adapter.

Result Clustering

After the information processing component 152 generates a query, thequery is sent to information sources 108 to retrieve information basedon the query. The information sources 108 will then return resultsrelated to the query. However, the results returned from the informationsources 108 often contain copies of the same page or similar pages fromthe same server or set of mirrored servers, the information processingcomponent 152 may filter the results to eliminate redundant information.In one embodiment, the information processing component 152 collectssearch results and clusters similar pages. Only a single representativefrom each cluster is displayed to the user. In general, the system mayfurther process and organize the results of searches in order tooptimize their presentation to the user.

In one embodiment, the information processing component 152 clustersredundant results based on the document's title, and its URL. theinformation processing component 152 employs the following heuristicsimilarity metrics for each of these pieces of information:

Heuristic 1: Title similarity. Two titles are similar if they overlapsignificantly (e.g., they share common sub-strings). The certainty ofsimilarity increases as a function of the square of the length of thetitle in words. Heuristic 2: URL similarity. Two URLs are similar ifthey have the same internal directory structure. The certainty ofsimilarity increases proportionally as a function of the square of thelength of the URL in directory units. More specifically, suppose twodocuments have titles T1[1 . . . n] and T2[1 . . . m], where each arrayelement is a character. Let maxSubStr(T1, T2) be the maximum subsequenceof T1 that occurs in T2. Then the similarity of T1 and T2 is defined aslength(maxSubStr(T1, T2))/max{length(T1), length(T2)}. If documents haveURLs U1[1 . . . n′] and U2[1 . . . m′], where each array element is aURL directory unit, the same similarity metric can be applied. When anew response arrives from the network, it is immediately processed, andthe resulting list of suggestions is updated and presented.

After the clustering process, the information processing component 152may make the search result visible to the user, such as listing a listof search result in a window next to the one that the user is workingon.

The disclosure has been described with reference to specific embodimentsthereof. It will, however, be evident that various modifications andchanges may be made thereto without departing from the broader spiritand scope of the disclosure. The concepts described in the disclosurecan apply to various operations of the networked presentation systemwithout departing from the concepts. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

1-9. (canceled)
 10. A method for obtaining representative text itemsfrom a plurality of text items in an active task, comprising the stepsof: (a) identifying attributes for each of the plurality of text items;and (b) excluding text items having at least one of the attributesconsisting of: containing less than n letters or characters, wherein nis a tunable number, unless the text items are part of an exception listor part of recognized constituent items; containing all numbers, unlessthe text items are part of an exception list or part of recognizedconstituent items; part of a stop list; part of a stop list includingtext items corresponding to a specific user; part of a stop listincluding text items corresponding to an information source; and part ofa link to retrieve a file or a web page.
 11. A machine-readable mediumbearing instructions for obtaining representative text items from aplurality of text items in an active task, the instructions uponexecution by a data processing system causing the data processing systemto perform the steps of: (a) identifying attributes for each of theplurality of text items; and (b) excluding text items having at leastone of the attributes consisting of: containing less than n letters orcharacters, wherein n is a tunable number, unless the text items arepart of an exception list or part of recognized constituent items;containing all numbers, unless the text items are part of an exceptionlist or part of recognized constituent items; part of a stop list; partof a stop list including text items corresponding to a specific user;part of a stop list including text items corresponding to an informationsource; and part of a link to retrieve a file or a web page. 12-26.(canceled)
 27. A method for obtaining representative text items based ona plurality of text items of an active task, comprising the steps of:(a) determining properties of the active task; (b) identifyingattributes for each of the plurality of text items; and (c) based on theproperties of the active task, excluding text items by applying at leastone exclusion rule.
 28. The method of claim 27, wherein the at least oneexclusion rule excludes text items having at least one of the attributesconsisting of: containing less than n letters or characters, wherein nis a tunable number, unless the text items are part of an exception listor part of recognized constituent items; containing all numbers, unlessthe text items are part of an exception list or part of recognizedconstituent items; part of a stop list; part of a stop list includingtext items corresponding to a specific user; part of a stop listincluding text items corresponding to an information source; and part ofa link to retrieve a file or a web page.
 29. The method of claim 27,wherein the properties of the active task include at least one ofapplication software being employed to perform the active task, the typeor genre of the active task, attributes related to the user manipulatingthe active task, properties of an information source on which a searchwill be conducted, and the state of the active task.